OPEN 3 ACCESS Freely available online 



•0-PLOS I ONE 



A Prostate Cancer Model Build by a Novel SVM-ID3 
Hybrid Feature Selection Method Using Both Genotyping 
and Phenotype Data from dbGaP 

Sait Can Yiiceba? 1 , Ye§im Aydin Son 1,2 * 

1 Medical Informatics Department, Graduate School of Informatics, Middle East Technical University. Ankara, Turkey, 2 Bioinformatics Graduate Program, Graduate School 
of Informatics, Middle East Technical University, Ankara, Turkey 



Abstract 

Through Genome Wide Association Studies (GWAS) many Single Nucleotide Polymorphism (SNP)-complex disease relations 
can be investigated. The output of GWAS can be high in amount and high dimensional, also relations between SNPs, 
phenotypes and diseases are most likely to be nonlinear. In order to handle high volume-high dimensional data and to be 
able to find the nonlinear relations we have utilized data mining approaches and a hybrid feature selection model of 
support vector machine and decision tree has been designed. The designed model is tested on prostate cancer data and for 
the first time combined genotype and phenotype information is used to increase the diagnostic performance. We were able 
to select phenotypic features such as ethnicity and body mass index, and SNPs those map to specific genes such as CRR9, 
TERT. The performance results of the proposed hybrid model, on prostate cancer dataset, with 90.92% of sensitivity and 0.91 
of area under ROC curve, shows the potential of the approach for prediction and early detection of the prostate cancer. 
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Introduction 

In Genome Wide Association Studies (GWAS) Single Nucleo- 
tide Polymorphisms (SNP)-complex disease associations are 
searched such as, age related macular degeneration [1], heart 
diseases [2], diabetes [3], rheumatoid arthritis [4], Crohn's Disease 
[5], Hypertension [6], Multiple Sclerosis [7] and cancer types [8- 
9-10] neurodegenerative diseases [11] and psychiatric diseases 
such as bipolar disorder [12]. Current GWAS of SNP profiles with 
such chronic and complex diseases are leading to the discovery of 
different genetic loci and individual SNPs related with the 
conditions, but association of only SNP genotyping profiles are 
not strong enough for prediction of disease condition. So, this 
study is designed to test the hypothesis if and to which degree 
integrating genotype profiles and the phenotypic features; 
including demographic information, environmental factors, life- 
style habits along with clinical findings of a patient will strengthen 
the predicative performance of the disease models. So far there 
isn't any publication that combines multiple genotypic and 
multiple phenotypic features, which would require implementation 
of new data mining approaches that can handle data with such 
different characteristics and even higher dimensionality. 

Methods used in GWAS can be grouped under two main 
categories which are parametric and non-parametric [13]. Non- 
parametric methods do not require a genetic model given 
beforehand; instead they build their own models based on given 
data by using data mining and machine learning [13]. Non- 
parametric methods are preferred due to the high dimensionality 
of the genetic data in which traditional statistic methods are not 



sufficient enough for the analysis [14]. Almost all known machine 
learning algorithms have been used in GWAS, some of the 
foremost methods are Decision Trees [15-16], Artificial Neural 
Networks [16], Bayesian Belief Networks [17], Support Vector 
Machines [18-19-20] and Genetic Algorithms [21]. For the 
analysis of genotyping data, as observed from various applications 
of data mining, there is no clear evidence that any of the methods 
performs better than others [13]. All methods have their own 
advantages and disadvantages, and the selection of the appropriate 
method is mostly based on the given problem, data type, study 
design and aim of the work. There are also few examples for the 
application of different hybrid data mining approaches with 
GWAS data to increase the predicative performance, in which one 
main method is selected and genetic based algorithms, are used as 
the second step for the optimization of the main method [22]. 

Here, for first time we are introducing a hybrid feature selection 
model combining two non-parametric data mining methods, SVM 
and ID3, for the determination of most predictive phenotypic and 
genotypic features related with a complex disease. As distinct from 
many works in the literature, in this study we have used both 
methods individually rather than just optimizing the main method. 
The prostate cancer data is used as a case study and we have 
demonstrated that combining genotype information with pheno- 
types has better predictive performance than using only genotypes 
or only phenotypes in disease diagnosis, while exceeding the 
performance of prostate specific antigen (PSA) screening test [23]. 
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Materials and Methods 

Prostate Cancer Data Set 

The dataset, "Multi Ethnic Genome Wide Scan of Prostate 
Cancer", used in this work is downloaded from NCBFs dbGaP 
database and has an accession number phs000306 version 2. This 
data consists of 4650 cases and 4795 controls with three different 
ethnicities, African Americans, Latinos and Japanese. Each 
individual in the study has 600,000 SNPs and 20 phenotypes 
and the number of subjects that contains both phenotypic and 
genotypic attributes is 9130. 

Data Preprocessing 

Data preprocessing consisted of three steps. In the first step 
Plink analysis was conducted in order to find the statistical power 
of relations between the genotype and the given disease. The 
threshold for the association of the SNPs with prostate cancer was 
determined as p<0.005 after the GWAS and 22,848 SNPs 
satisfying this condition formed the first representative subset. At 
second step METU-SNP's AHP (Analytical Hierarchical Process) 
feature was used to prioritize SNPs based on the biological and the 
statistical significance, which filtered the associated SNPs down to 
2710 SNPs. 

Data matching, cleaning and transformation were done in the 
final step of the data preprocessing. The genotypic and the 
phenotypic attributes of the subjects are combined in the data 
matching step based on the subject ID's and the subject ID 
conversions given in the manifest data. In the cleaning phase 
missing values caused by phenotypic attributes were replaced by 
class mean calculation and the attribute was deleted where class 
mean cannot be calculated. Data transformation was needed to 
code the alleles because SVMs use numerical values instead of 
categorical ones. In literature allele combinations are coded by 
three numeric values based on the heterozygous and homozygous 
major alleles [18]. Disadvantage of these schemes are that "the 
alleles are not treated symmetrically [1 8]". As the parent of origin was 
not indicated in our data we used an alternative coding scheme, in 
which symmetric alleles are treated in the same manner. This 
coding scheme is presented in Table 1. 

Analysis 

According to the literature the most widely used algorithms for 
detecting the relations between genotype information and the 
disease are ANN, SVM and Decision Trees. There are also 
examples for applications of different data mining approaches in a 

Table 1. Major allele coding scheme. 



Major Alleles 


Coding Value 


AA 1 


AT/TA 


2 


AC/CA 


3 


AG/GA 


4 


TT 5 


CT/TC 


6 


GT/TG 


7 


cc 


8 


GC/CG 


9 


GG 


10 


doi:1 0.1 371 /journal.pone.0091 404.t001 



hybrid manner to increase the predicative performance where one 
main method is selected and genetic based algorithms are used as 
the second step for optimization of the main method [15-22]. 

In our model we've combined two different methods, SVM and 
ID3, and for each of these methods an appropriate optimization 
was applied rather than combining a main method with an 
advanced optimization as stated above. By this way instead of 
benefitting from one strong method, we've combined the strengths 
of different methodologies; ID3's robustness to noise and outliers 
[24] as well as its power to handle non-linear problems and SVM's 
prediction performance over non-linear binary classification 
problems. Also both methods are more interpretable when 
compared to other methods. 

Our SVM-ID3 Hybrid Model was constructed in RapidMiner 
5.0 which is a free open source software tool for data mining 
applications and preferred in various applications in the literature 
such as [25]. For the SVM phase RBF kernel is chosen. This 
kernel is widely used in GWAS [19] and preferred in our study for 
its faster learning speed and its advantage of to be used as both 
linear kernel and sigmoid kernel in some special conditions [26]. 
Besides the kernel function SVM has two important parameters 
(C,y) if not adjusted well, could cause overfitting or underfitting of 
the condition. The C constant is used to adjust the margin of the 
hyperplane that separates the classes and gamma parameter gives 
its shape to decision boundary. Optimization of these parameters 
has been reported previously [27], and we have selected to apply 
the grid search approach for the optimization, which has been 
described previously [28]. The value ranges for C and gamma, 
used during the grid search is decided based on literature [27] 
along with our own experience with the data. For gamma the 
value range is selected in between [0.0001, 100] with powers often 
and the value range for C is selected in between [0-10] with five 
linear steps. The grid search for SVM optimization has lasted 
around ten hours to complete in a system with a 16 GB memory 
and 3.4 GHz Intel Core i7 processor, revealing 42 combinations. 

In literature there are various studies that combine SVMs and 
decision trees. Although previously published hybrid models of 
SVM and decision trees (SVM-DT) are generally used for multi- 
classification and multi-clustering problems, there are also 
examples of the SVM-DT combinations used for binary classifi- 
cation problems [29]. In all of the cases the SVM-DT models, 
SVM is applied first in order to optimize the parameters and the 
datasets to be used next in the decision tree. In our study we have 
also applied SVM in the first step, however instead of ranking the 
attributes and selecting the top listed ones according to SVM 
weights, which present a risk for loss of information, we have used 
the entire SVM weights as the weight feature in ID3. These 
weights for the ID3 attributes are calculated according to the 
formula given below. 

WeightedGR = W S y M , ai GainRatio{cii,S) 

The ID3 Tree is implemented on RapidMiner with weighting 
strategy explained above. A second grid search was run in order to 
find the optimum value for weighted information gain ratio. The 
range for this value was set in the range [10 3 , 10] and searched 
by 50 logarithmic steps which resulted in 51 combinations and 
completed in 1 1 hours. 

The overall workflow for the data pre-processing, which also 
includes GWAS and integration of phenotype and genotyping 
data, and the Hybrid SVM-Tree model described here is 
summarized in Figure 1. 
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Data Preprocessing 




SVM-ID3 Hybrid Model 



SVM Model 




Figure 1. Overall Workflow of the SVM-Tree Hybrid Model. Overall workflow starts with data preprocessing where representative SNP subset 
is formed by Plink and METU-SNP analysis, phenotype and genotyping data integrated and missing values are either eliminated or manually filled by 
class mean calculation. After the data preprocessing, integrated dataset is fed into hybrid model where SVM model gives the attribute weights which 
are used in ID3. 

doi:1 0.1 371 /journal.pone.0091 404.g001 



Results 

In the first phase only SVM model was run to present the 
classification performance of the stand-alone method on three 
different datasets. First and the second set was either only 
genotyping or phenotype data and the third dataset contained 
both genotyping and phenotype data. The results of the 
standalone SVM model are given in Table 2. 

These results in the Table 2 clearly shows that combining 
phenotypic information with genotype data slightly increased the 
decision performance in all aspects of accuracy, precision, recall 
and AUC. The hybrid SVM-ID3 model is then applied on the 
same three datasets and the performance comparison is presented 
in the Table 3. 



According to SVM ID3 hybrid model structure, given in Tree 
S 1 , the most important attribute is the ethnicity. Our model made 
a strict distinction on ethnicity attribute, which leads different 
decision paths for African American, Latino and Japanese 
subjects. For all ethnicities the body mass index (BMI) attribute 
is the second descriptive feature of the decision path. For African 
American population descriptive phenotypes on different levels of 
tree are the attributes that indicate smoking and alcohol 
consumption habits. Surprisingly only phenotypic attribute found 
for Japanese population is the BMI. Attributes indicating family 
history, physical activity, lycopene intake and smoking behavior 
are observed for Latin population. The overall tree structure of the 
hybrid model is presented in the Figure 2. 
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Table 2. Performance comparison of stand-alone SVM model. 





Performance Criteria 


Only-Genotype Dataset 


Only-Phenotype Dataset 


Integrated Genotype and Phenotype Dataset 


Accuracy 


59.02 


68.23 


72.46 


Precision 


61.29 


76.80 


82.68 


Recall 


63.15 


70.12 


71.34 


AUC 


0.606 


0.768 


0.829 



SVM model is tested by three different datasets, only genotype, only phenotype and integrated phenotype and genotype sets. Integrated data set performs best 
among others in terms of the performance criteria given. 
doi:1 0.1 371 /journal.pone.0091 404.t002 



Some of the prominent decision paths extracted from tree are 
mainly based on ethnicity. For example if the subject's ethnicity is 
African American and its BMI is in first category, which is 
BMK22.5, by looking at rsid 11729739 our hybrid system can 
decide whether the subject is a case or control. If the allelic profile 
for this SNP is TT then the subject is called as a case, but if the 
subject is heterozygous carrying CT, than the subject is called as a 
control. When the results of hybrid system for Japanese population 
are examined, the BMI was also in the first level of decision path. 
If the subjects are in fourth branch of BMI, which is > = 30, then 
these subjects are directly classified as case. If the subjects are in 
first branch of BMI then the decision is made based on the SNP 
rs2442602; the subjects homozygous for the major allele (with AA 
genotype) are called as cases, but the decisions for the subjects 
carrying other alleles require investigation of additional SNPs. 

The tree structure shows that the decision path for Latin 
population is more complex than the Japanese or African 
American populations. If the subjects are in first category of 
BMI then the subjects heterozygous for SNP rs 1 77992 19, carrying 
AG, are called healthy. If the subjects are in third category of BMI, 
which is <29.9, then a second phenotypic attribute, family history 
must be examined. If these subjects have first degree relatives with 
prostate cancer, then SNP rs6475584 is examined, to call if the 
subject is a case or not. Many rules, like given above, can be 
extracted from tree structure given in the Tree SI. 

Overall our hybrid model identified 28 SNPs for African 
American, 22 SNPs for Japanese and 65 SNPs for Latino 
populations. We have investigated the SNPs mapping to genes 
within the SNPNexus database [30] and the non-coding SNPs 
through RegulomeDB [31] in order to see if they have been 
associated with prostate cancer or any other condition before. 

When the SNPs found by hybrid model are searched through 
SNPnexus, 107 unique rsIDs matched with 62 unique Entrez 
GenelD and 42 of them were previously found to be associated 
with a condition listed in Genetic Association of Complex Diseases 
and Disorders (GAD) database. A representative set of genes- 
Table 3. Performance comparison of SVM-ID3 Hybrid Model. 



phenotypes and disease classes is given in the Table 4 and the 
whole list can be found in Table SI material. 

The non-coding SNPs in our final disease model are investi- 
gated through RegulomeDB, which showed that the SNPs found 
by our hybrid model have regulative effects. Table 5 below shows 
the SNPs with score lower than 4 from RegulomeDB. The whole 
list is given in the Table S2 material. 

Discussion 

Here, we have presented a diagnostic disease model utilizing 
data mining methods, based on phenotype and genotyping data 
for the prostate cancer. Overall our results showed that the hybrid 
model developed by integrating SVM and ID3 methods is capable 
of using both genotype and phenotype information as input, and 
has the best performance for predicting the case vs. controls. 

SVM is selected as the first step in our hybrid model as it is 
known for its high performance in GWAS [26], and ability to 
classify non-separable problems. The decision logic behind ANNs, 
which can also be utilized for GWAS, is not very clear because of 
its black box structure. Also ANNs have many parameters to 
adjust such as number of layers, number of nodes in layers, 
number of epochs and learning rate, and most importantly ANNs 
have the disadvantage of getting stuck at local minima. On the 
other hand SVMs has clear decision logic [20], has less number of 
parameters and due to the quadratic problem structure it only 
offers one solution, which is present at the global minima. As the 
second step in our hybrid model, ID3 decision tree is selected for 
its strong performance on classifying the discrete valued datasets as 
in GWAS. ID 3 is easy to construct and works with good 
performance on noisy data with missing values, and easy to 
interpret with its visual features [24]. ID3 is also advantageous 
over C4.5 and CART trees because these methods construct trees 
by pruning which would hide some decision paths for the disease, 
and ID3 is also more suitable for categorical data. 

To the best of our knowledge, there is no similar hybrid or 
stand-alone data mining method established as a gold standard for 





Performance Criteria 


Only-Genotype Dataset 


Only-Phenotype Dataset 


Integrated Genotype and Phenotype Dataset 


Accuracy 


71.67 


84.23 


93.81 


Precision 


72.69 


86.20 


96.55 


Recall 


68.96 


83.78 


90.92 


AUC 


0.674 


0.857 


0.91 



The hybrid SVM-ID3 model is tested on the same datasets, only genotype, only phenotype and integrated phenotype and genotype sets. Integrated data set performs 
best among others in terms of the performance criteria given. 
doi:1 0.1 371 /journal.pone.0091 404.t003 



PLOS ONE | www.plosone.org 



4 



March 2014 | Volume 9 | Issue 3 | e91404 



SVM-ID3 Hybrid Method to Build Disease Models 




Figure 2. Overall tree structure of the hybrid model. The main tree is given in the Tree SI material because the structure is too big. This figure 
is a small representation of main tree. Decision starts with ethnicity and African Americans are represented by AA, Japanese by JAP and Latinos by 
LAT. For all ethnicities the most descriptive phenotypic attribute is body mass index (BMI). Other phenotypic attributes that are in upper levels of tree 
are smoking behavior, family history, lycopene intake and physical activity. The number of SNPs in the nodes indicates the total number of SNPs 
found in different levels on that particular path of the tree. 
doi:10.1371/journal.pone.0091404.g002 



early diagnosis of prostate cancer. So, the performance results of 
the hybrid model had to be compared to the stand-alone SVM 
and ID3 models. The proposed Hybrid Model had better 
classification power over the stand-alone SVM and the ID 3 model 
with all three datasets, where either only genotyping or phenotype 
data is used and for the integrated genotype-phenotype dataset. In 
the integrated genotyping-phenotype dataset the hybrid SVM-ID3 
model with 90.92% sensitivity and 0.910 AUC outperformed the 
stand-alone SVM, and stand-alone decision tree which have 
71.34% sensitivity and 0.829 AUC and 81.33% sensitivity and 
0.732 AUC respectively. Additionally a three layer feed forward 
back propagation ANN structure was built in Rapid Miner and 
ran on the same combined genotype-phenotype dataset for 
comparison of performances. The execution run for 3 days to 
complete and the performance results in terms of accuracy, 
precision, and recall was all under 55%. Performance of ANN 
could be increased by optimizing the parameters used but this 
would cause the execution time to increase even higher. Even if 
the ANN could reach the same performance as the hybrid model, 
the long execution time would stand as another big disadvantage 
besides it being a black box algorithm. 

Overall, our hybrid model was capable of efficiently using the 
high-volume, high-dimensional integrated genotyping and pheno- 
type data as input. Currently, there are many published studies 
focused on analysis of genotyping data, but no example of 
combining phenotype with genotyping profile has been presented 



yet. Infilling this gap, for the first time genotyping and phenotype 
data are integrated together to build a diagnostic disease model for 
prostate cancer. As we have presented in Table 3, integrating the 
phenotype and genotype data increased the decision performance 
by terms of sensitivity and AUC. Sensitivity of the proposed hybrid 
model on a dataset with only genotypes is 68.69%, with only 
phenotypes is 83.78% where sensitivity increases to 90.92% when 
genotyping is integrated with phenotype data. In parallel to the 
sensitivity AUC value also increases; AUC for only genotyping 
data and only phenotype data are 0.674 and 0.857, respectively, 
but when both data is used AUC increases to 0.910. 

In addition to its better classification performance, our results 
showed that the proposed SVM - ID 3 Hybrid model was also able 
to identify the functional and regulatory SNPs related with 
prostate cancer. The selected SNPs and their gene-disease 
relations are checked by using the databases such as SNPnexus 
and RegulomeDB, which integrates third party information from 
different databases and studies in SNP-centric format. This means 
that the SNPs selected to build the diagnostic disease model with 
the proposed hybrid method are also candidates for further 
biological investigation of molecular etiology of the prostate 
cancer. 

The proposed hybrid method has identified 107 unique SNPs 
for the diagnostic model out of 2710 highly associated SNPs 
selected after GWAS. When these 107 SNPs are searched in 
SNPnexus and RegulomeDB some of them are found to be related 
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Table 4. SNPnexus results. 





Gene 


Entrez gene 


Phenotype 


Disease Class 


Pubmed 


MCPH1 


79648 


Adenocarcinoma|Pancreatic Neoplasms 


CANCER 


19690177 


MCPH1 


79648 


breast cancer 


CANCER 


20508983 


SMARCA4 


6597 


breast cancer 


CANCER 


19183483 


CSMD1 


64478 


Chromosomal lnstability|Cystadenocarcinoma, 
Serous|Ovarian Neoplasms 


CANCER 


19383911 


CSMD1 


64478 


Chromosomal lnstability|Cystadenocarcinoma, 
Serous|Ovarian Neoplasms 


CANCER 


19383911 


MTAP 


4507 


Melanoma|Nevus| Precancerous Conditions|Skin 
Neoplasms 


CANCER 


19578365 


MTAP 


4507 


melanoma|Nevus|Skin Neoplasms 


CANCER 


20574843 


MTAP 


4507 


melanoma|Nevus|Skin Neoplasms|Sunburn 


CANCER 


20647408 


MTAP 


4507 


Precursor Cell Lymphoblastic Leukemia-Lymphoma 


CANCER 


i yooDuoo 


ST6GALNAC3 


256435 


Alcoholism 


CHEMDEPENDENCY 


20421487 


ANGPT2 


285 


BMI- Edema rosiglitazone or pioglitazone 


PHARMACOGENOMIC 


18996102 


KLF7 


8609 


Body Weight|Diabetes Mellitus, Type 
2|Obesity|Overweight 


METABOLIC 


19147600 


MTAP 


4507 


diabetes, type 2 


METABOLIC 


11985785 


PACRG 


135138 


male infertility 


REPRODUCTION 


1 9268936 


SEMA5B 


54437 


Tobacco Use Disorder 


CHEMDEPENDENCY 


20379614 



The SNPs found by hybrid system are searched through SNPnexus. Many of them are found to be associated with specific genes and phenotypes. This table lists some 
of the genes that are matched by SNPs. As the disease class and phenotype indicates, our findings match with cancer disease class and the phenotypes searched for 
prostate cancer such as body mass index, smoking and drinking habits. 
doi:1 0.1 371 /journal.pone.0091 404.t004 



with specific genes and others affect regulation and binding. For 
example, rs2853668 is known to be associated with CRR9, TERT 
which plays an important role in the regulation of telomerase 
activity. The rsl 1790106 affects the regulation of ATP2B2 gene 
which is important for energy production and calcium transpor- 
tation of the cells, rsl 2644498 affects regulation of ARL9 gene and 
rs6887293 affects the regulation of AGBL4 which are also 



important for ATP/ GTP cycle in cells. These genes are closely 
related to IGF1 gene which plays an important role in insulin 
metabolism. Many of the genes, the 107 SNPs in the disease model 
map to, is related with growth and energy processes. These 
molecular functions are in fact related to the BMI, which the most 
important phenotypic attribute for all ethnicities found by our 
hybrid model. 



Table 5. High score SNPs from RegulomeDB. 



rsID 


Hits 


score 


rsl 433369 


Motifs|Footprinting|IRF, Motifs|PWM|DMRT5, Motifs|Footprinting|DMRT5, Motifs|Footprinting|STAT1, Motifs|PWM|IRF, 
Motifs|PWM|STAT1, Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|SMARCB1, 
Protein_Binding|ChlP-seq|POLR2A 


2,2 


rsl 17901 06 


Motifs|Footprinting|Pax-6, Motifs|PWM|Pax-6, Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, 
Protein_Binding|ChlP-seq|GATA1, Protein_Binding|ChlP-seq|HNF4A, Protein_Binding|ChlP-seq|HEY1, 
Protein_Binding|ChlP-seq|EP300, Protein_Binding|ChlP-seq|SMARCC2, Protein_Binding|ChlP-seq|CEBPB, 
Protein_Binding|ChlP-seq|FOXA2, Protein_Binding|ChlP-seq|NR3C1, Protein_Binding|ChlP-seq|STAT3, Protein_Binding| 
ChlP-seq|POLR2A, Protein_Binding|ChlP-seq|FOXA1, Protein_Binding|ChlP-seq|SRF, Protein_Binding|ChlP-seq|CDX2 


2,2 


rs6774902 


Motifs|PWM|MAF, Motifs|PWM|c-Ets-1, Motifs|Footprinting|c-Ets-1, Motifs|Footprinting|MAF, Chromatin_Structure|DNase-seq, 
Protein_Binding|ChlP-seq|RAD21, Protein_Binding|ChlP-seq|CTCF 


2,2 


rsl 7701 543 


Motifs|PWM|CP2, Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|CTCF 


3,1 


rsl 2644498 


Motifs|PWM|REST, Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|USF1 


3,1 


rsl 7375010 


Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|CTCF 


4 


rsl 0788555 


Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|STAT1, Protein_Binding|ChlP-seq|STAT3 


4 


rs6887293 


Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|FOXA1, Protein_Binding|ChlP-seq|GATA3 


4 


rs744346 


Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|ELK4 


4 


rs4562278 


Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChlP-seq|HNF4A 


4 



The SNPs found by hybrid system are searched thorough regulomeDB. Many of the found to be affect binding and this table lists the SNPs with score lower than 4. 
doi:1 0.1 371 /journal.pone.0091 404.t005 
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Resulting feature set of our hybrid model was examined and 
phenotypic attribute ethnicity was found to be the most related 
attribute with the prostate cancer. This result was not surprising 
because several works in the literature already showed that there is 
a relation with ethnic features and prostate cancer disease. 
Kleinmann's work shows that the ethnic background of the 
patients plays an important role in the prostate cancer related 
quality of life [32]. According to Hoffman, the etiology of the 
prostate cancer is highly depended on ethnicity and African 
American's has the highest risk for having prostate cancer [33] . As 
a supporting result, our hybrid model strictly divides the prostate 
dataset according to ethnicity and for each ethnicity different paths 
were observed. 

Although decision paths for ethnicities are all different, at the 
second level all decision paths indicate the BMI attribute. BMI is 
already known for its relations with different types of cancer such 
as breast cancer [34] and esophagus [35], and is also a strong 
phenotypic attribute for prostate cancer [36]. In literature along 
with BMI, age and family history, which are also among the 
selected attributes by our hybrid model, has been showed to be as 
important features for the diagnosis of the prostate cancer [36]. 
The preventive effect of high BMI values beyond 30 kg/ m been 
stated previously [36], and interestingly for Japanese population 
we have also observed the same preventive effect of BMI for 
morbid obese cases at the lower levels of the decision path. 
Additionally, other most common phenotypic attributes in the 
decision paths such as family history, smoking habit, physical 
activity and lycopene intake were also associated with prostate 
cancer previously [37]. Overall, our results show that the proposed 
hybrid model included the previously established phenotypic 
attributes for prostate cancer. 

Currently the blood Prostate Specific Antigen (PSA) levels is the 
gold standard for early detection of prostate cancer condition 
before biopsy, with the maximum sensitivity reported as 86%, and 
a specificity of 33% with AUG 0.67 [23-42]. PSA levels under 
4 ng/ml is considered normal, levels between 4 ng/ml-10 ng/ml 
are known as suspicious and levels higher than 1 0 ng/ ml known to 
be associated with high risk [38]. The problem with PSA test is 
determination of the thresholds. The range between 4 ng/ml- 
10 ng/ ml is a grey area for decision and while some subjects below 
4 ng/ml can have prostate cancer, but some above 1 0 ng/ ml can 
still be healthy [39]. In addition, the cut off values also change with 
respect to the subject's age [40] . This introduces a serious problem 
and as the various literature state PSA should not be used as an 
early diagnosis tool in prostate cancer [41] until its performance is 
increased in terms of sensitivity and specificity [42]. When the 
diagnostic performance results of the proposed hybrid model with 
90.92% sensitivity, and 0.91 AUC is considered, it presents a 
potentially good tool for the early detection of the prostate cancer. 
After validation with pilot studies, the proposed model which only 
requires a buccal swap would stand as a good alternative to blood 
PSA test. 

Here, for first time we have proposed a predicative disease 
model integrating genotyping and phenotype data through a 
hybrid feature selection, which combines two non-parametric data 
mining methods, SVM and ID3. As distinct from many works in 
the literature, in this study we have used both methods individually 
rather than just optimizing the main method. The prostate cancer 
data is used as a case study and we have demonstrated that the 
model combining genotype information with phenotypes yields a 
better performance than using only genotype or phenotype data in 
disease diagnosis while also exceeding the performance of prostate 
specific antigen (PSA) screening test [23], 



Conclusions 

In this study for the first time genotyping and phenotype data 
are integrated and a hybrid model of SVM-ID3 for prostate 
cancer is build. An important contribution of this work was the 
integration of genotyping with phenotype data. Effect of this 
integration is tested in both stand-alone SVM and SVM-ID3 
hybrid model. In terms of performance measures such as 
sensitivity and AUC the integrated data set outperformed the 
datasets with only genotype and with only phenotype in both 
models. Sensitivity and AUC of integrated dataset for stand-alone 
SVM was 71.34% and 0.829 respectively. When the same 
integrated dataset is used in the hybrid model sensitivity increased 
to 90.92% and AUC increased to 0.91, also outperforming the 
blood PSA test. The model was able to identify prostate cancer 
associated SNPs that either map to a cancer specific genes such as 
CRR9, TERT, ATP2B2, ARL9, and AGBL4 and/or with regulatory 
effects. Experimental and clinical validation of the described 
associations for prostate cancer can lead us to better understand 
the progression of the disease at the molecular level. Additionally, 
the descriptive phenotypes selected by the hybrid model were also 
previously identified features for their relations with prostate 
cancer in previous studies. Ethnicity was observed to be at the root 
of the decision tree structure, whereas BMI, family history and 
smoking were the other phenotypes that are at the top levels of the 
decision model. Overall, our study showed that the predictive 
disease model build with the hybrid SVM-ID3 approach based on 
genotyping and phenotype data provides a promising tool for early 
detection of the prostate cancer. After validation of the proposed 
model with pilot studies, it can be implemented as a clinical 
decision support module to evaluate patients risk to develop 
prostate cancer, and the phenotypes related to life style (BMI, 
exercise, smoking, etc..) that have high impact on patients risk can 
be identified for each individual to be monitored in the upcoming 
visits. 

Further studies on the proposed hybrid SVM-ID3 method and 
other data mining approaches for the integrative analysis of the 
GWAS results and phenotypic information would aid in develop- 
ment of other successful disease models, which would excel the 
translation of variant-disease association findings into the clinical 
setting for the development of new decision support tools and 
personalized medicine approaches. 
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